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Abstract —Recent research shows that deep neural networks 
(DNNs) can be used to extract deep speaker vectors (d-vectors) 
that preserve speaker characteristics and can be used in speaker 
verification. This new method has been tested on text-dependent 
speaker verification tasks, and improvement was reported when 
combined with the conventional 1-vector method. 

This paper extends the d-vector approach to semi text- 
independent speaker verification tasks, l.e., the text of the speech 
is in a limited set of short phrases. We explore various settings 
of the DNN structure used for d-vector extraction, and present a 
phone-dependent training which employs the posterior features 
obtained from an ASR system. The experimental results show 
that it is possible to apply d-vectors on semi text-independent 
speaker recognition, and the phone-dependent training improves 
system performance. 

Index Terms —deep neural networks, speaker vector, speaker 
verification 

I. Introduction 

S PEAKER verification, also known as voiceprint verifica¬ 
tion, is an important biometric authentication technique 
that has been widely used to verify speakers’ identities. 
According to the text that are allowed to speak in enrollment 
and test, speaker verihcation systems can be categorized 
into either text-dependent or text-independent. While a text- 
dependent system requires the same words/sentences to be 
spoken in enrollment and test, a text-independent system 
permits any words to speak. This paper focuses on a semi 
text-independent scenario where the words for enrollment and 
test are constrained in a limited set of short phrases, e.g., ‘turn 
on the radio’. With this limitation, people can speak different 
sentences in enrollment and test while the system performance 
is not signihcantly deteriorated, which makes the system more 
acceptable in practice. 

Most of the successful approaches to speaker verihcation are 
based on generative models and with unsupervised learning, 
e.g., the famous Gaussian mixture model-universal background 
model (GMM-UBM) framework m. A number of advanced 
models have been proposed based on the GMM-UBM archi¬ 
tecture, among which the i-vector model El 13 is perhaps the 
most successful. Despite the impressive success, the GMM- 
UBM model and the subsequent i-vector model share the 
intrinsic disadvantage of all unsupervised learning methods; 

This work was supported by the National Natural Science Foundation of 
China under Grant No. 61371136 and No. 61271389, it was also supported 
by the National Basic Research Program (973 Program) of China under Grant 
No. 2013CB329302. The authors are with Division of Technical Innovation 
and Development of Tsinghua National Laboratory for Information Science 
and Technology and Research Institute of Information Technology (RUT) of 
Tsinghua University. This paper is also supported by Sinovoice and Pachira. 
(Corresponding e-mail: fzheng@tsinghua.edu.cn) 


the goal of the model training is to describe the distributions 
of the acoustic features, instead of discriminating speakers. 

This problem can be solved in two directions. The hrst 
direction is to employ various discriminative models to en¬ 
hance the generative framework. Eor example, the SVM model 
for GMM-UBMs ID, and the FED A model for i-vectors Q- 
All these approaches provide signihcant improvement over the 
baseline. Another direction is to look for more discriminative 
features, i.e., the features that are more sensitive to speaker 
change and largely invariant to change of other irrelevant 
factors, such as phone contents and channels i). However, 
the improvement obtained by the ‘feature engineering’ is much 
less signihcant compared to the achievements obtained by the 
discriminative models such as SVM and PLDA. A possible 
reason is that most of the features are human-crafted and thus 
tend to be suboptimal in practical usage. 

Recent research on deep learning offers a new idea of 
‘feature learning’. It has been shown that with a deep neural 
network (DNN), task-oriented features can be learned layer 
by layer from very raw features. Eor example in automatic 
speech recognition (ASR), phone-discriminative features can 
be learned from spectrum or hlter bank energies (Ebanks). 
The learned features are very powerful and have defeated 
the conventional feature based on Mel frequency cepstral 
coefficients (MECCs) that has dominated in ASR for several 
decades Q. 

This favorable property of DNNs in learning task-oriented 
features can be utilized to learn speaker-discriminative features 
as well. A recent study shows that this is possible at least 
in text-dependent tasks El. The authors constructed a DNN 
model and set the training objective as to discriminate a set 
of speakers, and for each frame, the speaker-discriminative 
features were read from the activations of the last hidden 
layer. They tested the method on a foot-print text-dependent 
speaker verihcation task (only a short phrase ‘ok, google’). 
The experimental results showed that reasonable performance 
can be achieved with the DNN-based features, although it is 
still difficult to compete the i-vector baseline. 

In this paper, we extend the application of the DNN-based 
feature learning approach to semi text-independent tasks, and 
present a phone-dependent training which involves phone 
posteriors obtained from an ASR system in the training. The 
experimental results show that the DNN-based feature learning 
works well on text-independent tasks, actually even better 
than on text-dependent tasks, and the phone-dependent training 
offers marginal but consistent gains. 

The rest of this paper is organized as follows. Section [I^ 
describes the related work, and Section [represents the DNN- 
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based speaker feature learning. The experiments are presented 
in Section IV and Section concludes the paper. 


II. Related work 

This paper follows the work in lH). The difference is that 
we extend the application of the DNN-based feature learning 
approach to semi text-independent tasks, and we introduce a 
phone-dependent training. Due to the mismatched content of 
the enrollment and test speech, our task is more challenging. 

The DNN model has been employed in speaker verification 
in other ways. For example, in 13, DNNs trained for ASR 
were used to replace the UBM model to derive the acoustic 
statistics for i-vector model training. In ifTOll . a DNN was used 
to replace PLDA to improve discriminative capability of i- 
vectors. All these methods rely on the generative framework, 
i.e., the i-vector model. The DNN-based feature learning 
presented in this paper is purely discriminative, without any 
generative model involved. 


III. DNN-based eeature learning 

This section presents the DNN-based feature learning. We 
first describe the main structure of the model and the learning 
process, and propose the phone-dependent learning. Finally the 
difference between the i-vector approach and the DNN-based 
approach is discussed. 


A. DNN-based feature extraction 

It is well-known that DNNs can learn task-oriented features 
from raw features layer by layer. This property has been 
employed in ASR where phone-discriminative features are 
learned from very low-level features such as Fbanks or even 
spectrum Q. It has been shown that with a well-trained 
DNN, variations irrelevant to the learning task are gradually 
eliminated when the input feature is propagated through the 
DNN structure layer by layer. This feature learning is so 
powerful that in ASR, the primary Fbank feature has defeated 
the MFCC feature that was carefully designed by people and 
has dominated in ASR for several decades. 

This property can be also employed to learn speaker- 
discriminative features. Actually researchers have put much 
effort in looking for features that are more discriminative for 
speakers 0, but the effort is mostly vain and the MFCC is 
still the most popular choice. The success of DNNs in ASR 
suggests a new direction that speaker-discriminative features 
can be learned from data instead of crafted by hand. The 
learning can be easily done and the process is rather similar as 
in ASR, with the only difference that in speaker verification, 
the learning goal is to discriminate different speakers. 

Fig. 0 presents the DNN structure used for the speaker- 
discriminative feature learning. Following the convention of 
ASR, the input layer involves a window of 40-dimensional 
Fbanks. In this work, the window size is set to 21, which was 
found to be optimal in our work. There are 4 hidden layers, 
and each consists of 200 units. The units of the output layer 
correspond to the speakers in the training data, and the number 
is 80 in our experiment. The 1-hot encoding scheme is used 
to label the target, and the training criterion is set to cross 



Oiitpjt layer Is removed ir> 


Fig. 1: The DNN structure used for learning speaker- 
discriminative features. 


entropy. The learning rate is set to 0.008 at the beginning, 
and is halved whenever no improvement on a cross-validation 
(CV) set is found. The training process stops when the learning 
rate is too small and the improvement on the CV set is too 
marginal. 

Once the DNN has been trained successfully, the speaker- 
discriminative features can be read from any hidden layer. 
More the layer is close to the output, more the features are 
speaker-discriminative. Our experiments show that features 
extracted from the last hidden layer perform the best, which 
is similar to the observation in HI. 

In the test phase, the features are extracted for all the frames 
of the given utterance, and the features are averaged to form a 
speaker vector. Following the nomenclature in ISl, we call this 
speaker vector as ‘d-vector’. Similar to i-vectors, a d-vector 
represents the speaker identity of an utterance in the speaker 
space. The same methods used for i-vectors can be used for 
d-vectors to conduct the test, for example by computing the 
cosine distance or applying PLDA. 

B. Phone-dependent training 

A potential problem of the DNN-based speaker- 
discriminative feature learning described in the previous 
section is that it is a ‘blind learning’, i.e., the features are 
learned from raw data without any prior information. This 
means that the learning purely relies on the complex deep 
structure of the DNN model and a large amount of data to 
discover speaker-discriminative patterns. If the training data is 
abundant, this is often not a problem; however in tasks with a 
limited amount of data, for instance the semi text-independent 
task in our hand, this blind learning tends to be difficult 
because there are too many speaker-irrelevant variations 
involved in the raw data, particularly phone contents. 

A possible solution is to inform the DNN which phone the 
current input frame belongs to. This can be simply achieved 
by adding a phone indicator in the DNN input. However, it 
is often not easy to get the phone alignment for the speech 
data. An alternative to the phone indicator is a vector of phone 
posterior probabilities, which can be easily obtained from any 
phone discriminant model. In this work, we choose a DNN 
model that was trained for an ASR system to produce the 
phone posteriors. Fig. illustrates the DNN structure with 
the phone posterior vector involved in the input. The training 
process for the new structure does not change. 




JOURNAL OF LTbX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 


3 






Fig. 2: The DNN structure used for phone-dependent training. 


We note that this phone-dependent training is more impor¬ 
tant for text-independent recognition. For the text-dependent 
recognition, the acoustic features are limited in a small set of 
phones, and so involving the phone information in the training 
does not help much. 

C. Comparison between i-vectors and d-vectors 

The two kinds of speaker vectors, the d-vector and the i- 
vector, are fundamentally different. I-vectors are based on a 
linear Gaussian model, for which the learning is unsupervised 
and the learning criterion is maximum likelihood on acoustic 
features. In contrast, d-vectors are based on neural networks, 
for which the learning is supervised, and the learning criterion 
is maximum discrimination for speakers. This difference in 
model structures and learning methods leads to significant 
different properties of these two vectors. 

First, the i-vector is ‘descriptive’, which represents the 
speaker by constructing a GMM (derived from the i-vector) to 
fit the acoustic features. In contrast, the d-vector is ‘discrim¬ 
inative’, which represents the speaker by removing speaker- 
irrelevant variance. 

Second, the i-vector can be regarded as a ‘global’ speaker 
description, which is inferred from ‘all’ the frames of an 
utterance; however the d-vector is a ‘local’ description, which 
is inferred from ‘each’ frame, and only the context information 
is used in the inference. This means that the d-vector tends 
to be more superior with a short utterance, while the i-vector 
tends to perform better with a relative long utterance. 

Third, the i-vector approach more relies on the enroll¬ 
ment data to form a reasonable distribution that can be 
used to discriminate different speakers; whereas the d-vector 
approach more relies on the ‘universal’ data to learn speaker- 
discriminative features. This means that a large amount of 
training data (labelled with speakers) is much more important 
and useful for the d-vector approach. 

IV. Experiments 


The training set involves 80 randomly selected speakers, 
which results in 12000 utterances in total. To prevent over¬ 
fitting, a cross-validation (CV) set containing 1000 utterances 
is selected from the training data, and the remaining 11000 
utterances are used for model training, including the DNN 
model in the d-vector approach, and the UBM, the T matrix, 
the EDA and PLDA model in the i-vector approach. 

The evaluation set consists of the remaining 20 speakers. 
In the text-dependent experiment, the evaluation is performed 
for each particular phrase; and in the semi text-independent 
experiment, all the utterances in the evaluation set (3000 in 
total) are cross evaluated, resulting in 223500 target trials and 
4275000 non-target trials. 


B. Text-dependent recognition 

The first experiment investigates the performance of the d- 
vector approach on text-dependent speaker verification tasks, 
and compare it to the i-vector baseline. A similar work has 
been reported in IHl, here we just reproduce that work and 
propose some improvement by leveraging text-independent 
data. 

For clearance, we report the results on two randomly 
selected phrases, denoted by PI and P2 respectively. For 
each phrase, the corresponding utterances are selected from 
the training set to train the i-vector system and the d-vector 
system respectively, and the corresponding utterances in the 
evaluation set are selected to perform the test. This means that 
the training data for each phrase consists of 1200 utterances, 
and the test consists of 300 utterances. For the i-vector system, 
the number of Gaussian mixtures of the UBM is 64, and the 
i-vector dimension is 200. These values have been chosen 
to optimize the performance. The DNN architecture for the 
d-vector system has been shown in Section III For a fair 
comparison, the dimension of the d-vector is set to 200 as 
well. 

The tests are based on three scoring methods; the basic 
cosine distance, the cosine distance after reducing the dimen¬ 
sion to 80 by LDA, and the score provided by PLDA. Table |I] 
reports the results in terms of equal error rate (EER). It can be 
seen that the d-vector system obtains reasonable performance, 
however, the results are much worse than those with the i- 
vector system. Similar observations have been reported in fS)- 


TABLE I: EER results on text-dependent task 




EER% 


Phrase 

Cosine 

LDA 

PLDA 

i-vector 

PI 

4.91 

4.62 

4.05 

d-vector 

PI 

12.05 

9.52 

10.76 

i-vector 

P2 

3.86 

3.10 

2.76 

d-vector 

P2 

8.86 

7.00 

8.90 


A. Database 

The experiments are performed on a short phrase speech 
database provided by Pachira. The entire database contains 
recordings of 10 short phrases from 100 speakers (gender 
balanced), and each phrase contains 2 ~ 5 Chinese characters. 
For each speaker, every phrase is recorded 15 times, amounting 
to 150 utterances per speaker. 


As discussed in Section Im] the DNN model of the d- 
vector system can be enhanced by borrowing data from text- 
independent tasks. The results are reported in Table |I^ It can 
be observed that with more training data, the performance of 
d-vector systems is generally improved, despite that the extra 
data are recordings of other phrases. Another observation is 
that with more training data, the PLDA model tends to be less 
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Fig. 3; The DNN structure with non-linear dimension reduc¬ 
tion. 

effective. This can be possibly explained by the fact that d- 
vectors are derived from activations of neural network units 
and so probably do not fit the linear Gaussian model that 
PLDA assumes. 


TABLE II: EER results with additional data 




EER% 

Phrase 

Training 

Cosine 

LDA 

PLDA 

PI 

PI 

12.05 

9.52 

10.76 

PI 

P1,P2 

11.57 

8.29 

10.57 

PI 

P1,P2,...,P15 

11.14 

8.14 

11.00 

P2 

P2 

8.86 

7.00 

8.90 

P2 

P1,P2 

7.95 

5.81 

6.91 

P2 

P1,P2,...,P15 

8.33 

5.43 

7.95 


C. Semi text-independent recognition 

This experiment examines the d-vector approach on the semi 
text-independent task. The dimension of both i-vectors and 
d-vectors is fixed to 200, and the dimension of the LDA- 
projected vectors is 80. In order to have the two systems 
involve the same amount of parameters, the number of Gaus¬ 
sian components of the i-vector system is set to 128. All the 
utterances in the training dataset are used to train the DNN 
model and the i-vector model. 

The results of the two systems are reported in Table 
It can be observed that with the simple cosine distance, 
the d-vector system outperforms the i-vector system in a 
significant way. This demonstrates that the discriminatively 
learned d-vectors are more discriminative for speakers when 
compared with the generatively learned i-vectors. However, 
when the discriminative normalization methods (EDA and 
PLDA) are employed, the performance of the i-vector system 
is significantly improved and better than that of the d-vector 
system. The discriminative methods contribute very little to 
the d-vector system. This is not supervising, as the d-vectors 
have been discriminative already. 

Nevertheless, the slight improvement with LDA suggests 
that there is some redundancy in d-vectors. Motivated by this 
idea, a new hidden layer with a small number of units is 
inserted into the DNN structure, as shown in Pig. The 

dimension of the new layer is set to 100, which is the best 
choice in our test. Compared to LDA, this approach can 
be regarded as a non-linear dimension reduction (NLDR). 
Additional performance is achieved with this method, as has 
been shown in the last column of Table m 



Pig. 4: The EER results of the d-vector and i-vector combi¬ 
nation system. The x-axis represents the interpolation weight 
a. 

D. Phone-dependent training 

In this experiment, the phone posteriors are included in the 
input of the DNN structure, as shown in Pig. The phone 
posteriors are produced by a DNN model that was trained 
for ASR with a Chinese database consisting of 6000 hours 
of speech data. The phone set consists of 66 toneless initial 
and finals in Chinese, plus the silence phone. The results 
are shown in the third row of Table |In] It can be seen that 
the phone-dependent training leads to marginal but consistent 
performance improvement for the d-vector system. The NLDR 
approach is also applied, and an additional gain is obtained. 

TABLE III: EER results on semi text-independent task 



PDTR 

cosine 

LDA 

PLDA 

NLDR 

i-vector 

- 

19.32 

11.09 

8.70 

- 

d-vector 

- 

13.58 

13.07 

15.45 

12.79 

d-vector 

+ 

13.21 

12.76 

15.48 

12.55 


E. Combination system 

Pollowing 111, we combine the best i-vector system (PLDA) 
and the best d-vector system (NLDR with phone-dependent 
training). The combination is simply done by interpolating the 
scores obtained from the two systems. The EER results with 
various values of the interpolation factor (denoted by a) are 
drawn in Pig. It can be seen that the combination leads to 
the better performance with an appropriate set of a. The best 
EER is 7.14%, which is the lowest EER we can obtain with 
this dataset so far. 

V. Conclusions 

This paper investigated DNN-based discriminative feature 
leaning for speaker recognition, and studied the performance 
of this approach on a semi text-independent task. The exper¬ 
imental results demonstrated that the DNN-based approach 
can offer reasonable performance, and outperformed the i- 
vector baseline with simple cosine distance. However, when 
discriminative normalization methods such as LDA and PLDA 
are applied, the i-vector approach exhibits better performance. 

Although it has not beat the i-vector approach at present, 
the d-vector approach is very promising and potentially can be 
improved in several ways. Particularly, a powerful probabilistic 
model on d-vectors would deal with inter-frame uncertainty 
and so may considerably enhance system performance. We 
leave this as the future work. 
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